Wine is a beverage made from fermented grape and other fruit juices with lower amount of alcohol content. Quality of wine is graded based on the taste of wine and vintage. This process is time taking, costly and not efficient. A wine itself includes different parameters like fixed acidity, volatile acidity, citric acid, residual sugar chlorides, free sulphur dioxide, total sulphur dioxide, density, pH, sulphates, alcohol and quality.
In industries, understanding the demands of wine safety testing can be a complex task for the laboratory th numerous analytes and residues to monitor But, our applications prediction, provide ideal solutions for the analysis of wine, which will make this whole process efficient and cheaper with less human interaction.
Our main objective is to predict the wine quality using machine learning through Python programming language
A large dataset is considered and wine quality modelled to analyse the quality of wine through different parameters like fixed acidity, volatile acidity etc. All these parameters will be analyse through Machine Leaming algorithms like random forest classifier algorithm which will helps to rate wine on scale 1-10 or bad-good. Output obtained would further be checked for correctness and model will be optimized accordingly.
It can support the wine expert evaluation and ultimately improve the production.
Citation Request: This dataset is public available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
Title: Wine Quality
Sources Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009
Past Usage:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).
Relevant Information:
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.
Number of Instances: red wine - 1599; white wine - 4898.
Number of Attributes: 11 + output attribute
Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.
Attribute information:
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10)
Missing Attribute Values: None
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
import time
sns.set()
%matplotlib inline
# Read data set
data = pd.read_csv('QualityPrediction.csv')
data.head()
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
| 2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
| 3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
| 4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
We have 11 features and one target.
Lets understand more about the data.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1599 entries, 0 to 1598 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fixed acidity 1599 non-null float64 1 volatile acidity 1599 non-null float64 2 citric acid 1599 non-null float64 3 residual sugar 1599 non-null float64 4 chlorides 1599 non-null float64 5 free sulfur dioxide 1599 non-null float64 6 total sulfur dioxide 1599 non-null float64 7 density 1599 non-null float64 8 pH 1599 non-null float64 9 sulphates 1599 non-null float64 10 alcohol 1599 non-null float64 11 quality 1599 non-null int64 dtypes: float64(11), int64(1) memory usage: 150.0 KB
We have total 1599 records with non Null and numeric 11 features and one target variable 'qulaity'
data.describe()
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 |
| mean | 8.319637 | 0.527821 | 0.270976 | 2.538806 | 0.087467 | 15.874922 | 46.467792 | 0.996747 | 3.311113 | 0.658149 | 10.422983 | 5.636023 |
| std | 1.741096 | 0.179060 | 0.194801 | 1.409928 | 0.047065 | 10.460157 | 32.895324 | 0.001887 | 0.154386 | 0.169507 | 1.065668 | 0.807569 |
| min | 4.600000 | 0.120000 | 0.000000 | 0.900000 | 0.012000 | 1.000000 | 6.000000 | 0.990070 | 2.740000 | 0.330000 | 8.400000 | 3.000000 |
| 25% | 7.100000 | 0.390000 | 0.090000 | 1.900000 | 0.070000 | 7.000000 | 22.000000 | 0.995600 | 3.210000 | 0.550000 | 9.500000 | 5.000000 |
| 50% | 7.900000 | 0.520000 | 0.260000 | 2.200000 | 0.079000 | 14.000000 | 38.000000 | 0.996750 | 3.310000 | 0.620000 | 10.200000 | 6.000000 |
| 75% | 9.200000 | 0.640000 | 0.420000 | 2.600000 | 0.090000 | 21.000000 | 62.000000 | 0.997835 | 3.400000 | 0.730000 | 11.100000 | 6.000000 |
| max | 15.900000 | 1.580000 | 1.000000 | 15.500000 | 0.611000 | 72.000000 | 289.000000 | 1.003690 | 4.010000 | 2.000000 | 14.900000 | 8.000000 |
Here its very clear that we have all numeric fields and none of them is having any missing values. max and mean value of the total sulfer dioxide is too high than other feature values so reuired to do feature scalling.
# Import pandas_profiling to get Initial EDA.
import pandas_profiling as pp
pp.ProfileReport(data)
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
As per above EDA we have 220 rows (13.8% of total data set) as duplicate rows. Duplicate rows will add unnecessary noise into our data so this will weaken our model, So we can remove these duplicates.
# Remove duplicate records and keep one record each.
data.drop_duplicates(keep='first')
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 5 |
| 1 | 7.8 | 0.880 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.99680 | 3.20 | 0.68 | 9.8 | 5 |
| 2 | 7.8 | 0.760 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.99700 | 3.26 | 0.65 | 9.8 | 5 |
| 3 | 11.2 | 0.280 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.99800 | 3.16 | 0.58 | 9.8 | 6 |
| 5 | 7.4 | 0.660 | 0.00 | 1.8 | 0.075 | 13.0 | 40.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1593 | 6.8 | 0.620 | 0.08 | 1.9 | 0.068 | 28.0 | 38.0 | 0.99651 | 3.42 | 0.82 | 9.5 | 6 |
| 1594 | 6.2 | 0.600 | 0.08 | 2.0 | 0.090 | 32.0 | 44.0 | 0.99490 | 3.45 | 0.58 | 10.5 | 5 |
| 1595 | 5.9 | 0.550 | 0.10 | 2.2 | 0.062 | 39.0 | 51.0 | 0.99512 | 3.52 | 0.76 | 11.2 | 6 |
| 1597 | 5.9 | 0.645 | 0.12 | 2.0 | 0.075 | 32.0 | 44.0 | 0.99547 | 3.57 | 0.71 | 10.2 | 5 |
| 1598 | 6.0 | 0.310 | 0.47 | 3.6 | 0.067 | 18.0 | 42.0 | 0.99549 | 3.39 | 0.66 | 11.0 | 6 |
1359 rows × 12 columns
# get target values count
data.groupby('quality')['quality'].count()
quality 3 10 4 53 5 681 6 638 7 199 8 18 Name: quality, dtype: int64
# create a barplot to compare the counts of the quality
sns.countplot(data['quality'])
plt.title('Original Distribution of the Quality')
Text(0.5, 1.0, 'Original Distribution of the Quality')
Each expert graded the wine quality between 0 (very bad) to 10 (very excellent). We can split quality into two categories.
So let do it. Bad Quality (0) Good Quality (1)
# Mapping quality as 0 or 1
data['quality_cat'] = [1 if x > 5 else 0 for x in data.quality]
data.head()
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | quality_cat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 | 0 |
| 1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 | 0 |
| 2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 | 0 |
| 3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 | 1 |
| 4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 | 0 |
# create bar plot for new category with (0 and 1)
sns.countplot(data['quality_cat'])
plt.title('New Distribution of the Quality')
Text(0.5, 1.0, 'New Distribution of the Quality')
Quality_cat is looks like balanced data. Which is one of important requirement while dealing with classification problems.
During EDA we found some correlation between features. Lets see correlation using heatmap
# Plot heatmap to see if any features are highly correlated so we can remove it.
plt.figure(figsize=(15,10))
cor_plot = sns.heatmap(data.corr(), cmap="YlGnBu", annot=True, linewidth=.5)
plt.show()
Highest Correlation between features is 0.67 and -0.68. This correlation between 'fixed acidity' and 'pH','density', 'citric acid', so we can remove feature 'fixed acidity'. Another highest correlation 0.67 is between 'free sulfur dioxide' and 'total sulfur dioxide' so we can remove one of the feature
# Removed highly correlated feature 'fixed acidity'
data.drop(['fixed acidity'], axis = 1, inplace= True)
# Removed highly correlated feature 'free sulfur dioxide'
data.drop(['free sulfur dioxide'], axis = 1, inplace= True)
data.head()
| volatile acidity | citric acid | residual sugar | chlorides | total sulfur dioxide | density | pH | sulphates | alcohol | quality | quality_cat | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.70 | 0.00 | 1.9 | 0.076 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 | 0 |
| 1 | 0.88 | 0.00 | 2.6 | 0.098 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 | 0 |
| 2 | 0.76 | 0.04 | 2.3 | 0.092 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 | 0 |
| 3 | 0.28 | 0.56 | 1.9 | 0.075 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 | 1 |
| 4 | 0.70 | 0.00 | 1.9 | 0.076 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 | 0 |
# Correlation after removing feature 'fixed acidity'
plt.figure(figsize=(15,10))
cor_plot = sns.heatmap(data.corr(), cmap="YlGnBu", annot=True, linewidth=.5)
plt.show()
Now we are not having any highly correlated features.
# lets split the data into train and test dataset
features = data.iloc[:,:-2]
target = data['quality_cat']
Xtrain, Xtest, Ytrain, Ytest = train_test_split (features, target, test_size = 0.2, random_state = 10, shuffle =True )
Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations, this is a problem. If one feature is having lower values and other is having much higher values so the feature having higher values will dominate the model predictions so we need to scale all the features in one scale.
# Scaling the data
# Using StandardScaler
from sklearn.preprocessing import StandardScaler
SS = StandardScaler()
Xtrain_scaled = SS.fit_transform(Xtrain)
Xtest_scaled = SS.transform(Xtest)
Now we are ready with our train and test data lets import a classification algorithm and start model building.
# Import necessary packages
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import RocCurveDisplay
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
##Spot-Checking Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('SVM', SVC()))
models.append(('RF', RandomForestClassifier()))
#testing models
results = []
names = []
for name, model in models:
kfold = KFold(n_splits=10, random_state=42, shuffle=True)
cv_results = cross_val_score(model, Xtrain_scaled, Ytrain, cv=kfold, scoring='roc_auc') # looking for roc auc value as metric
results.append(cv_results)
names.append(name)
model_performance = '%s: %f (%f)' % (name, cv_results.mean(), cv_results.std())
print(model_performance)
LR: 0.814435 (0.026608) LDA: 0.815038 (0.027111) KNN: 0.812069 (0.019961) CART: 0.746283 (0.035089) SVM: 0.836117 (0.023499) RF: 0.888111 (0.017494)
#Compare Algorithms
fig = plt.figure(figsize=(12,10))
plt.title('Comparison of Classification Algorithms')
plt.xlabel('Algorithm')
plt.ylabel('ROC-AUC Score')
plt.boxplot(results)
#ax = fig.add_subplot(111)
plt.show()
# Import support vector machine algorithm
from sklearn import svm
# kernal = rbf
model_svc = svm.SVC(kernel='rbf')
model_svc.fit(Xtrain_scaled, Ytrain)
Ypred = model_svc.predict(Xtest_scaled)
print('roc_auc_score Score:')
print(metrics.roc_auc_score(Ytest,Ypred))
roc_auc_score Score: 0.7408241476038087
# kernal = linear
model_svc = svm.SVC(kernel='linear')
model_svc.fit(Xtrain_scaled, Ytrain)
Ypred = model_svc.predict(Xtest_scaled)
print('roc_auc_score Score:')
print(metrics.roc_auc_score(Ytest,Ypred))
roc_auc_score Score: 0.7420291572833945
# kernal = poly
model_svc = svm.SVC(kernel='poly')
model_svc.fit(Xtrain_scaled, Ytrain)
Ypred = model_svc.predict(Xtest_scaled)
print('roc_auc_score Score:')
print(metrics.roc_auc_score(Ytest,Ypred))
roc_auc_score Score: 0.7292679072340089
# cross-validation import from sklearn
from sklearn.model_selection import cross_val_score
C_range = list(range(1,30))
roc_auc_score=[]
for c in C_range:
svc = svm.SVC(kernel='rbf',C=c)
scores = cross_val_score(svc,Xtrain_scaled,Ytrain,cv=10,scoring='roc_auc')
roc_auc_score.append(scores.mean())
print(roc_auc_score)
[0.8362592174134151, 0.8367765979648526, 0.8363567348887193, 0.8362332683256811, 0.8361111151091217, 0.8359614535806983, 0.8358145926586932, 0.8357905746100741, 0.8364301383631505, 0.8362333402898718, 0.8359641702288941, 0.8363826120122626, 0.8362127465373229, 0.8355264720275191, 0.8353571822661043, 0.83501396105298, 0.8349179908077741, 0.8346485388870498, 0.8341304147056425, 0.8335421794115249, 0.8331477436827435, 0.8326565640937069, 0.8325328336619698, 0.8320921909257952, 0.8321166167714944, 0.8319459896755375, 0.8313807169552432, 0.831453190892212, 0.8313786899638739]
plt.figure(figsize=(10,6))
C_values = list(range(1,30))
# plot C values in X-axis and cross_validate_accuracy on y-axis
plt.plot(C_values,roc_auc_score)
plt.xticks(np.arange(0,30,1))
plt.xlabel('Value of C for SVM')
plt.ylabel('Cross-Validate roc_auc_score')
Text(0, 0.5, 'Cross-Validate roc_auc_score')
Highest training score getting at C=2 and Kernal = 'rbf'
# Lets try with gamma
gamma_range = [0.0001,0.001,0.01,0.1,1,10,100]
roc_auc_score = []
for g in gamma_range:
svc = svm.SVC(kernel='rbf',gamma=g)
scores = cross_val_score(svc,Xtrain_scaled,Ytrain,cv=10,scoring='roc_auc')
roc_auc_score.append(scores.mean())
print(roc_auc_score)
[0.8064085251179014, 0.8110681344962746, 0.8201240362795474, 0.8366541628885467, 0.8452108670725448, 0.8418234436544376, 0.7717804510475588]
Highest score is getting at gamma = 1. Lets try some more values between 1 to 10.
# Lets try with gamma
gamma_range = [1,2,3,4,5,6,7,8,9,10]
roc_auc_score = []
for g in gamma_range:
svc = svm.SVC(kernel='rbf',gamma=g)
scores = cross_val_score(svc,Xtrain_scaled,Ytrain,cv=10,scoring='roc_auc')
roc_auc_score.append(scores.mean())
print(roc_auc_score)
[0.8452108670725448, 0.8555459563321293, 0.8576152926303872, 0.8537298740146904, 0.8508775853135478, 0.8478968135455798, 0.8448296247787102, 0.8431356477017037, 0.8416528465435599, 0.8418234436544376]
plt.figure(figsize=(10,6))
C_values = list(range(1,11))
# plot C values in X-axis and cross_validate_accuracy on y-axis
plt.plot(C_values,roc_auc_score)
plt.xticks(np.arange(0,11,1))
plt.xlabel('Value of gamma for SVM')
plt.ylabel('Cross-Validate roc_auc_score')
Text(0, 0.5, 'Cross-Validate roc_auc_score')
Highest training score at gamma = 3.
#Finding best parameters for our SVC model
from sklearn.model_selection import GridSearchCV
param = {
'C': [1,2,3,4,5],
'kernel':['linear', 'rbf'],
'gamma' :[1,2,3,4,5]
}
grid_svc = GridSearchCV(model_svc, param_grid=param, scoring='roc_auc', cv=10)
grid_svc.fit(Xtrain_scaled, Ytrain) # fit/train the model
GridSearchCV(cv=10, estimator=SVC(kernel='poly'),
param_grid={'C': [1, 2, 3, 4, 5], 'gamma': [1, 2, 3, 4, 5],
'kernel': ['linear', 'rbf']},
scoring='roc_auc')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=10, estimator=SVC(kernel='poly'),
param_grid={'C': [1, 2, 3, 4, 5], 'gamma': [1, 2, 3, 4, 5],
'kernel': ['linear', 'rbf']},
scoring='roc_auc')SVC(kernel='poly')
SVC(kernel='poly')
#Best parameters for our svc model
grid_svc.best_params_
{'C': 1, 'gamma': 3, 'kernel': 'rbf'}
# create model with best parameters
model_svc_bp = svm.SVC(kernel='rbf',C=1, gamma = 3)
model_svc_bp.fit(Xtrain_scaled, Ytrain)
from sklearn.metrics import classification_report
Ypred = model_svc_bp.predict(Xtest_scaled)
print ('*'*10 + 'Test confusion metrics'+ '*'*10)
print(classification_report(Ytest, Ypred))
**********Test confusion metrics**********
precision recall f1-score support
0 0.79 0.57 0.66 143
1 0.72 0.88 0.79 177
accuracy 0.74 320
macro avg 0.75 0.72 0.73 320
weighted avg 0.75 0.74 0.73 320
# Random Forest Algorithm
rf = RandomForestClassifier()
kfold = KFold(n_splits=10, random_state=42, shuffle=True)
cv_results = cross_val_score(rf, Xtrain_scaled, Ytrain, cv=kfold, scoring='roc_auc')
msg = '%s: %f (%f)' % (name, cv_results.mean(), cv_results.std())
print(msg)
RF: 0.889027 (0.021100)
param_dist = {'max_depth': [18,19,20,21,22,25],
'bootstrap': [True, False],
'max_features': ['auto', 'sqrt', 'log2', None],
'criterion': ['gini', 'entropy']}
cv_rf = GridSearchCV(rf, cv = 10,
param_grid=param_dist,
n_jobs = 3)
cv_rf.fit(Xtrain_scaled, Ytrain)
print('Best Parameters using grid search: \n', cv_rf.best_params_)
Best Parameters using grid search:
{'bootstrap': True, 'criterion': 'entropy', 'max_depth': 21, 'max_features': 'auto'}
# build Model using Hyperparamaters
rf = RandomForestClassifier(criterion='entropy', bootstrap = True, max_depth = 21, max_features = 'auto' )
kfold = KFold(n_splits=10, random_state=42, shuffle=True)
cv_results = cross_val_score(rf, Xtrain_scaled, Ytrain, cv=kfold, scoring='roc_auc')
msg = '%s: %f (%f)' % (name, cv_results.mean(), cv_results.std())
print(msg)
RF: 0.891717 (0.019372)
Now we are getting ROC_AUC score = 0.891717 which is slightly higher than any other algorithm.
# Set best parameters given by grid search
rf.set_params(criterion='entropy', bootstrap = True, max_depth = 21, max_features = 'auto')
RandomForestClassifier(criterion='entropy', max_depth=21, max_features='auto')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(criterion='entropy', max_depth=21, max_features='auto')
# OOB rate
rf.set_params(warm_start=True,
oob_score=True)
min_estimators = 15
max_estimators = 1000
error_rate = {}
for i in range(min_estimators, max_estimators + 1):
rf.set_params(n_estimators=i)
rf.fit(Xtrain_scaled, Ytrain)
oob_error = 1 - rf.oob_score_
error_rate[i] = oob_error
# Convert dictionary to a pandas series for easy plotting
oob_series = pd.Series(error_rate)
fig, ax = plt.subplots(figsize=(10, 10))
ax.set_facecolor('#fafafa')
oob_series.plot(kind='line',color = 'red')
plt.axhline(0.055, color='#875FDB',linestyle='--')
plt.axhline(0.05, color='#875FDB',linestyle='--')
plt.xlabel('n_estimators')
plt.ylabel('OOB Error Rate')
plt.title('OOB Error Rate Across various Forest sizes \n(From 15 to 1000 trees)')
Text(0.5, 1.0, 'OOB Error Rate Across various Forest sizes \n(From 15 to 1000 trees)')
print('OOB Error rate for 200 trees is: {0:.5f}'.format(oob_series[220]))
OOB Error rate for 200 trees is: 0.17514
# Refine the tree via OOB Output
rf.set_params(criterion='entropy', bootstrap = True, max_depth = 21, max_features = 'auto', n_estimators=220, oob_score=False)
RandomForestClassifier(criterion='entropy', max_depth=21, max_features='auto',
n_estimators=220, warm_start=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(criterion='entropy', max_depth=21, max_features='auto',
n_estimators=220, warm_start=True)kfold = KFold(n_splits=10, random_state=42, shuffle=True)
cv_results = cross_val_score(rf, Xtrain_scaled, Ytrain, cv=kfold, scoring='roc_auc')
msg = '%s: %f (%f)' % (name, cv_results.mean(), cv_results.std())
print(msg)
RF: 0.890050 (0.018854)
accuracy_rf = rf.score(Xtest_scaled, Ytest)
print("Here is our accuracy on the test set:\n {0:.3f}"\
.format(accuracy_rf))
Here is our accuracy on the test set: 0.812
# Here we calculate the test error rate!
test_error_rate_rf = 1 - accuracy_rf
print("The test error rate for our model is:\n {0: .4f}"\
.format(test_error_rate_rf))
The test error rate for our model is: 0.1875
# visualize roc curve
from sklearn.metrics import RocCurveDisplay
disp_roc = RocCurveDisplay.from_estimator(rf, Xtest_scaled, Ytest)
plt.show()
#Confusion Metrics
Ypredict = rf.predict(Xtest_scaled)
cfs_metrics = confusion_matrix(Ytest, Ypredict, labels=rf.classes_)
print (cfs_metrics)
[[112 31] [ 29 148]]
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cfs_metrics, display_labels=rf.classes_)
disp.plot(cmap='GnBu')
plt.show()
print(metrics.classification_report(Ytest, Ypredict))
precision recall f1-score support
0 0.79 0.78 0.79 143
1 0.83 0.84 0.83 177
accuracy 0.81 320
macro avg 0.81 0.81 0.81 320
weighted avg 0.81 0.81 0.81 320
As per objective of the project our model can predict the given wine as good or bad with accuracy of 81%. As a business requirenment we are concern about both Precision and Recall so we can consider F1-score/ROC_AUC score for model evaluation. More false positives/lower precision would lead to the unnecessary production cost and decrease in the brand value. More false negative/lower recall will not allow to produce good wine. Model based on Random Forest algorithm gave highest F1-score/ROC_AUC score in training and testing. So we can use model based on random forest algorithm for predicting the wine quality. This will help the business to take better decisions with less intervention of the experts.